Batch and Online Spam Filter Comparison

نویسندگان

Gordon V. Cormack

Andrej Bratko

چکیده

In the TREC 2005 Spam Evaluation Track, a number of popular spam filters – all owing their heritage to Graham’s A Plan for Spam – did quite well. Machine learning techniques reported elsewhere to perform well were hardly represented in the participating filters, and not represented at all in the better results. A non-traditional technique Prediction by Partial Matching (PPM) – performed exceptionally well, at or near the top of every test. Are the TREC results an anomaly? Is PPM really the best method for spam filtering? How are these results to be reconciled with others showing that methods like Support Vector Machines (SVM) are superior? We address these issues by testing implementations of five different classification methods on the TREC public corpus using the online evaluation methodology introduced in TREC. These results are complemented with cross validation experiments, which facilitate a comparison of the methods considered in the study under different evaluation schemes, and also give insight into the nature and utility of the evaluation regimens themselves. For comparison with previously published results, we also conducted cross validation experiments on the Ling-Spam and PU1 datasets. These tests reveal substantial differences attributable to different test assumptions, in particular batch vs. on-line training and testing, the order of classification, and the method of tokenization. Notwithstanding these differences, the methods that perform well at TREC also perform well using established test methods and corpora. Two previously untested methods – one based on Dynamic Markov Compression and one using logistic regression – compare favorably with competing approaches.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning and Detecting Concept Drift

The volume of data that humans create has increased explosively as information science and technology have evolved. Therefore, the demand for learning machines that can extract input-output mappings and knowledge rules from massive data sets has become more urgent, and machine learning is now a core technology in the advanced information society. It has been applied to fields such as pattern re...

متن کامل

Feature-based Malicious URL and Attack Type Detection Using Multi-class Classification

Nowadays, malicious URLs are the common threat to the businesses, social networks, net-banking etc. Existing approaches have focused on binary detection i.e. either the URL is malicious or benign. Very few literature is found which focused on the detection of malicious URLs and their attack types. Hence, it becomes necessary to know the attack type and adopt an effective countermeasure. This pa...

متن کامل

Not So Naive Online Bayesian Spam Filter

Spam filtering, as a key problem in electronic communication, has drawn significant attention due to increasingly huge amounts of junk email on the Internet. Content-based filtering is one reliable method in combating with spammers changing tactics. Naı̈ve Bayes (NB) is one of the earliest content-based machine learning methods both in theory and practice in combating with spammers, which is eas...

متن کامل

SpamCooling: A Parallel Heterogeneous Ensemble Spam Filtering System Based on Active Learning Techniques

Anti-spam technology is developing rapidly in recent years. With the emerging applications of machine learning in diverse fields, researchers as well as manufacturers around the world have attempted a large number of related algorithms to prevent spam. In this paper, we designed an effective anti-spam protection system, SpamCooling, based on the mechanism of active learning and parallel heterog...

متن کامل

Not So Naı̈ve Online Bayesian Spam Filter

Spam filtering, as a key problem in electronic communication, has drawn significant attention due to increasingly huge amounts of junk email on the Internet. Content-based filtering is one reliable method in combating with spammers’ changing tactics. Naı̈ve Bayes (NB) is one of the earliest content-based machine learning methods both in theory and practice in combating with spammers, which is ea...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

Batch and Online Spam Filter Comparison

نویسندگان

چکیده

منابع مشابه

Learning and Detecting Concept Drift

Feature-based Malicious URL and Attack Type Detection Using Multi-class Classification

Not So Naive Online Bayesian Spam Filter

SpamCooling: A Parallel Heterogeneous Ensemble Spam Filtering System Based on Active Learning Techniques

Not So Naı̈ve Online Bayesian Spam Filter

عنوان ژورنال:

اشتراک گذاری